Advanced Data Wrangling
12 February, 2024
You already….
Please install and load the following packages
Access lecture slide from the course landing page
I am Ayush.
I am a researcher working at the intersection of data, law, development and economics.
I teach Data Science using R at Gokhale Institute of Politics and Economics
I am a RStudio (Posit) certified tidyverse Instructor.
I am a Researcher at Oxford Poverty and Human development Initiative (OPHI), at the University of Oxford.
Reach me
ayush.ap58@gmail.com
ayush.patel@gipe.ac.in
table1 table2 and table3, all part of tidyverse# A tibble: 6 × 4
country year cases population
<chr> <dbl> <dbl> <dbl>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
# A tibble: 12 × 4
country year type count
<chr> <dbl> <chr> <dbl>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
# A tibble: 6 × 3
country year rate
<chr> <dbl> <chr>
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
Source : R for Data Science 2e
table1, calculate the total number of cases per yeartable2# A tibble: 3 × 3
country `1999` `2000`
<chr> <dbl> <dbl>
1 Afghanistan 745 2666
2 Brazil 37737 80488
3 China 212258 213766
1999 and 2000 are values of the year variablecases variable1999 and 2000)year)cases)# A tibble: 6 × 3
country year cases
<chr> <chr> <dbl>
1 Afghanistan 1999 745
2 Afghanistan 2000 2666
3 Brazil 1999 37737
4 Brazil 2000 80488
5 China 1999 212258
6 China 2000 213766
pivot_wider() is the opposite of pivot_longer()# A tibble: 12 × 4
country year type count
<chr> <dbl> <chr> <dbl>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
# A tibble: 6 × 4
country year cases population
<chr> <dbl> <dbl> <dbl>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
Run the following code in your console
DIY1 already in tidy data structure? If not, how would you change it?DIY1_1 shows the grades of students in class for Periodical 1 and Periodical 2. Is the data in tidy format? If not, how would you transform it?join all these datasets together# A tibble: 16 × 2
carrier name
<chr> <chr>
1 9E Endeavor Air Inc.
2 AA American Airlines Inc.
3 AS Alaska Airlines Inc.
4 B6 JetBlue Airways
5 DL Delta Air Lines Inc.
6 EV ExpressJet Airlines Inc.
7 F9 Frontier Airlines Inc.
8 FL AirTran Airways Corporation
9 HA Hawaiian Airlines Inc.
10 MQ Envoy Air
11 OO SkyWest Airlines Inc.
12 UA United Air Lines Inc.
13 US US Airways Inc.
14 VX Virgin America
15 WN Southwest Airlines Co.
16 YV Mesa Airlines Inc.
airlines has two variables - carrier code and name# A tibble: 26,115 × 15
origin year month day hour temp dewp humid wind_dir wind_speed
<chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4
2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06
3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5
4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7
5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7
6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5
7 EWR 2013 1 1 7 39.0 28.0 64.4 240 15.0
8 EWR 2013 1 1 8 39.9 28.0 62.2 250 10.4
9 EWR 2013 1 1 9 39.9 28.0 62.2 260 15.0
10 EWR 2013 1 1 10 41 28.0 59.6 260 13.8
# ℹ 26,105 more rows
# ℹ 5 more variables: wind_gust <dbl>, precip <dbl>, pressure <dbl>,
# visib <dbl>, time_hour <dttm>
weather records data about the weather at the origin airportsorigin and time_hour become the compound primary key# A tibble: 336,776 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
# A tibble: 3,322 × 9
tailnum year type manufacturer model engines seats speed engine
<chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
1 N10156 2004 Fixed wing multi… EMBRAER EMB-… 2 55 NA Turbo…
2 N102UW 1998 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
3 N103US 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
4 N104UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
5 N10575 2002 Fixed wing multi… EMBRAER EMB-… 2 55 NA Turbo…
6 N105UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
7 N107US 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
8 N108UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
9 N109UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
10 N110UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
# ℹ 3,312 more rows
flights$tailnum is a foreign key that corresponds to the primary key planes$tailnumairports and planes data from nycflights13?flights that corresponds to the primary key in airports?Relationships and connections between all datasets in the nycflights13 package
Source : R for Data Science 2e
left_join(),inner_join(),right_join(),full_join(),semi_join() and anti_join()Run the following code chunks in your console
data2 to existing data1left_join() retains the first dataset (the one you are joining to)join_by() specifies the primary key to be used for joiningdata1 and leaves out additional rows that it does not match to from data2data1 which are not present in data2 have NA values in edu_levelright_join() does the oppositedata2 to all the variables of data1right_join() you have all the rows from data2 joined to data1data2 which are not in data1 have NA values in remaining columnsinner_join() comes to the rescuefull_join()NA values where data is missing from data1 and data2 name ID age state edu_level
1 A 193 23 MH 12
2 B 176 24 MP 8
3 C 6 32 KA 10
4 D 273 43 MP 12
5 E 132 23 UP 12
6 F 182 54 GJ 0
7 G 1 23 MH NA
8 H 42 20 UP NA
9 I 67 40 MH NA
10 J 20 53 TN NA
11 <NA> 11 NA <NA> 2
12 <NA> 13 NA <NA> 5
13 <NA> 15 NA <NA> 10
dplyr functionsnycflights13, how can you join flights and airlinesflights and planes be joined? If yes, how?flights with and planesflights and airlines such that the resulting data has three variables - carrier , name and origindata1 that have a match in data2data1 with data2flights with airlinesflights are NOT there in planes dataset? Check using an appropriate join functionLoad the following dataset
hospital_visitsname has NA valuestidyr::fill()number_of_visits column, here NA means 0 visitsdplyr::coalesce() can be used to replace themyear variablehospital_visits, we do not know what NA values in year and want to remove these observations fullydrop_na() drops NA values from the year variable name year number_of_visits
1 John 2016 8
2 <NA> 2017 10
3 <NA> 2019 NA
4 Dave 2016 10
5 <NA> 2018 NA
6 <NA> 2017 7
7 <NA> 2018 NA
drop_na()Run the following in your console
Ozone variable, replace all NA values with 0Solar.R is NAslice() functionslice() allows indexing rows by their locationsslice(1) would show the observation at the first position i.e. first observationslice_min() would give the minimum valuegroup_by , arrange() and then the slice() function would mean that it will return the first position for each groupgroup_by, it performs functions based on groupsrowwise() helps in aggregating and performing functions on rowsRun the following in your console
You want to calculate the mean income that an indivial earns from all 3 sources
This means mean of 3 rows for each individual
Take a look at the difference between the two outputs
Using only mutate() computes the mean of the three variables across all rows
id income_source1 income_source2 income_source3 mean_income
1 1 10 12 20 18.38889
2 2 15 14 25 18.38889
3 3 12 11 22 18.38889
4 4 14 16 24 18.38889
5 5 20 25 30 18.38889
6 6 18 15 28 18.38889
rowwise() gives mean of three variables for each rowrow_data %>%
rowwise() %>%
mutate(mean_income = mean(c(income_source1, income_source2, income_source3)))# A tibble: 6 × 5
# Rowwise:
id income_source1 income_source2 income_source3 mean_income
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 10 12 20 14
2 2 15 14 25 18
3 3 12 11 22 15
4 4 14 16 24 18
5 5 20 25 30 25
6 6 18 15 28 20.3
c_across to the rescue!# A tibble: 6 × 5
# Rowwise:
id income_source1 income_source2 income_source3 total_income
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 10 12 20 42
2 2 15 14 25 54
3 3 12 11 22 45
4 4 14 16 24 54
5 5 20 25 30 75
6 6 18 15 28 61
# A tibble: 6 × 5
# Rowwise:
id income_source1 income_source2 income_source3 total_income
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 10 12 20 42
2 2 15 14 25 54
3 3 12 11 22 45
4 4 14 16 24 54
5 5 20 25 30 75
6 6 18 15 28 61
yrbss data from openintro packagegrade, in order to see the first observation of one person from each gradephysically_active_7d and strength_training_7d for each individual